Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

MHSA

MLP

Add & Norm

Classifier

Input

query

key

value

Attention score

Matrix

Multiplication

MHSA

MLP

Add & Norm

Patch Embedding

Teacher activations

Distribution Guided Distillation (DGD)

Information Rectification Module (IRM)

FIGURE 2.4

Overview of Q-ViT, applying Information Rectiﬁcation Module (IRM) for maximizing rep-

resentation information and Distribution Guided Distillation (DGD) for accurate optimiza-

tion.

inevitably deteriorates the attention module’s representation capability in capturing the in-

put’s global dependency. Second, the distillation for the fully quantized ViT baseline utilizes

a distillation token (following [224]) to directly supervise the quantized ViT classiﬁcation

output. However, we found that such a simple supervision could be more eﬀective, which

is coarse-grained because of the large gap between the quantized attention scores and their

full-precision counterparts.

To address the issues above, a fully quantized ViT (Q-ViT) [136] is developed by retain-

ing the distribution of quantized attention modules as that of full-precision counterparts (see

the overview in Fig. 2.4). Accordingly, we propose to modify the distorted distribution over

quantized attention modules through an Information Rectiﬁcation Module (IRM) based

on information entropy maximization in the forward process. In the backward process, we

present a distribution-guided distillation (DGD) scheme to eliminate the distribution vari-

ation through attention similarity loss between the quantized ViT and the full-precision

counterpart.

2.3.1

Baseline of Fully Quantized ViT

First, we build a baseline to study fully quantized ViT since it has never been proposed in

previous work. A straightforward solution is quantifying the representations (weights and

activations) in ViT architecture in the forward propagation and applying distillation to the

optimization in the backward propagation.

Quantized ViT architecture.

We brieﬂy introduce the technology of neural network

quantization. We ﬁrst introduce a general asymmetric activation quantization and symmet-

ric weight quantization scheme as

Qa(x) = ⌊clip{(x −z)/αx, −Q^x

n^{, Q}^x

p^}⌉

Qw(w) = ⌊clip{w/αw, −Q^w

n ^{, Q}^w

p ^}⌉

ˆx = Qa(x) × αx + z,

ˆw = Qw(w) × αw.

(2.13)

Here, clip{y, r1, r2} returns y with values below r1 set as r1 and values above r2 set as r2,

and ⌊y⌉rounds y to the nearest integer. With quantization of activations on signed a bits

and weights to signed b bits, Q^x

n ^{= 2}^a⁻¹^{, Q}^x

p ^{= 2}^a⁻¹⁻^{1 and}^Q^w

n ^{= 2}^b⁻¹^{, Q}^w

p ^{= 2}^b⁻¹⁻^{1. In}

general, the forward and backward propagation of the quantization function in the quantized